Search CORE

79 research outputs found

Finding viable seed URLs for web corpora: A scouting approach and comparative study of available sources

Author: Barbaresi Adrien
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

International audienceThe conventional tools of the "web as corpus" framework rely heavily on URLs obtained from search engines. Recently, the corresponding querying process became much slower or impossible to perform on a low budget. I try to find acceptable substitutes, i.e. viable link sources for web corpus construction. To this end, I perform a study of possible alternatives, including social networks as well as the Open Directory Project and Wikipedia. Four different languages (Dutch, French, Indonesian and Swedish) taken as examples show that complementary approaches are needed. My scouting approach using open-source software leads to a URL directory enriched with metadata which may be used to start a web crawl. This is more than a drop-in replacement for existing tools since said metadata enables researchers to filter and select URLs that fit particular needs, as they are classified according to their language, their length and a few other indicators such as host- and markup-based data

HAL-ENS-LYON

CiteSeerX

Crossref

La Raison aveugle ? L'époque cybernétique et ses dispositifs

Author: Barbaresi Adrien
Publication venue: HAL CCSD
Publication date: 25/10/2012
Field of study

Programme de la journée d'études : http://calenda.org/220954L'affirmation de Martin Heidegger (en 1966) selon laquelle la cybernétique va désormais prendre la place de la philosophie donne le ton de la vision pessimiste d'une société dominée par la technique. En fait d'étrangeté, la modernité technique est fréquemment éprouvée sous le signe de l'accélération, de l'accroissement, de l'appauvrissement du vécu et elle est diagnostiquée à la suite de Heidegger comme étant un retrait de l'humain face à une rationalité tendue vers le progrès de l'uniformisation et de la fonctionnalité ainsi que vers la recherche mathématique de l'efficacité. On peut voir avec Gilbert Hottois une filiation entre cette techno-logie opératoire (avec les discours qu'elle implique) et la calculabilité des signes chez Leibniz. Le critère de vérité au sens de cette ars characteristica s'entend en termes de véracité logique et se voit détaché de toute interprétation, ce qui ouvre la voie à une raison combinatoire dite " aveugle " (" cognitio caeca vel symbolica "). On connaît la portée de la mécanique leibnizienne concernant la technique moderne et plus précisément les systèmes informatiques. On connaît également le primat du champ du visible en philosophie, du terme d'" idée " à l'association de l'esprit et de la lumière par exemple. Dès lors, il paraît opportun de faire une critique de la technique pensée comme une Raison aveugle qui méconnaît la portée des signes. La concrétisation de la Raison sous la forme d'une machine et l'agencement de l'humain sur ce modèle (pour Foucault), le règne de la cybernétique comprise comme science du gouvernement systématisé des vivants (pour Heidegger), les technosciences (pour Henry) sont autant d'entrées dans la critique des logiques et des dispositifs

HAL-ENS-LYON

Two comparable corpora of German newspaper text gathered on the web: Bild & Die Zeit: Technical report

Author: Barbaresi Adrien
Publication venue: HAL CCSD
Publication date: 15/07/2013
Field of study

This technical report documents the creation of two comparable corpora of German newspaper text, focused on the daily tabloid Bild and the weekly newspaper Die Zeit. Two specialized crawlers and corpus builders were designed in order to crawl the domain names bild.de and zeit.de with the objective of gathering as many complete articles as possible. A high content quality was made possible by the specially designed boilerplate removal and metadata recording code. As a result, two separate corpora were created. Currently, the last version for Bild is from 2011 and the last version for Die Zeit is from early 2013. The corpora feature a total of respectively 60 476 and 134 222 articles. Whereas the crawler designed for Bild has been discontinued due to frequent layout changes on the website, the other one concerning Die Zeit is still actively maintained, its code has been made available under an open source license

HAL-ENS-LYON

HAL

A one-pass valency-oriented chunker for German

Author: Barbaresi Adrien
Publication venue: HAL CCSD
Publication date: 07/12/2013
Field of study

International audienceNon-finite state parsers provide fine-grained information. However, they are computationally demanding. Therefore, it is interesting to see how far a shallow parsing approach is able to go. In a pattern-based matching operation, the transducer described here consists of POS-tags using regular expressions that take advantage of the characteristics of German grammar. The process aims at finding linguistically relevant phrases with a good precision, which enables in turn an estimation of the actual valency of a given verb. The chunker reads its input exactly once instead of using cascades, which greatly benefits computational efficiency. This finite-state chunking approach does not return a tree structure, but rather yields various kinds of linguistic information useful to the language researcher. Possible applications include simulation of text comprehension on the syntactical level, creation of selective benchmarks and failure analysis

HAL-ENS-LYON

Challenges in web corpus construction for low-resource languages in a post-BootCaT world

Author: Barbaresi Adrien
Publication venue: HAL CCSD
Publication date: 07/12/2013
Field of study

Software available under an open-source license: FLUX: Filtering and Language-identification for URL Crawling Seeds https://github.com/adbar/flux-toolchainInternational audienceThe state of the art tools of the "web as corpus" framework rely heavily on URLs obtained from search engines. Recently, this querying process has become very slow or impossible to perform on a low budget. In order to find reliable data sources for Indonesian, I perform a case study of different kinds of URL sources and crawling strategies. First, I classify URLs extracted from the Open Directory Project and Wikipedia for Indonesian, Malay, Danish, and Swedish in order to enable comparisons. Then I perform web crawls focusing on Indonesian and using the mentioned sources as the start URLs. My scouting approach using open-source software results in a URL database with metadata which can be used to replace or at least to complement the BootCaT approach

HAL-ENS-LYON

HAL

Challenges in the linguistic exploitation of specialized republishable web corpora

Author: Barbaresi Adrien
Publication venue: HAL CCSD
Publication date: 08/06/2015
Field of study

Short paper talk at RESAW 2015 conference (Aarhus, Denmark).International audienceI would like to present work on texts corpora in German, gathered on the Web and processed in order to be made available to linguists and a broader user community via a web interface. The corpora are specialized in the sense that they only address a particular text genre or source at a time. Web crawling techniques are used to download the documents, then they are stored roughly in the way web archives do. More precisely, I would like to talk about two cases where texts are expected to be republishable: a "standard" case, political speeches, and a "borderline" case, German blogs under CC license.The work is performed in the context of a digital dictionary of German. The primary user base consists of lexicographers, who need valuable or at least exploitable evidence, in the form of precise quotes or definition elements.The actual gathering and processing of the corpora is described elsewhere (anonymized references). In this talk I would like to focus on a series of challenges that are to be solved in order to make data from web archives accessible to researchers and to study web text corpora: metadata extraction, quality assurance, licensing, and "scientificity".1. A proper metadata extraction is needed in order to make further downstream applications possible. It has to be performed meticulously, since experience shows that even small or rare mistakes in date encoding for instance may cause the application to be disregarded or discarded by researchers in the humanities, since linguistic trends cannot be identified properly if the content is not ordered in time. Easily available metadata in the case of speeches constrast with different content types, encodings, and markup patterns concerning the blogs. Compromises have to be made without sacrificing recall, since republishable texts are rather rare.2. Regarding the content, quality assurance is paramount, since a high quality is expected by users, all the more since they may feel reluctant to use web texts for their studies. In fact, providing "Hi-Fi" web corpora also means promoting the cause of web sources and modernization of research methodology.3. The results are hosted in Germany, and thus German copyright laws apply, which can be considered to be more restrictive than others. Additionally, there are a number of issues with licensing in general and CC licenses in particular, even with manual verification: the CC ND and (to a lesser extent) NC predicates can hinder proper republication. There are also potential copyright issues regarding blog comments.To sum up the issues described above, much work flows into ensuring the "scientificity" of web texts and making the texts not only available but also citable in a scholarly sense

Placenames analysis in historical texts: tools, risks and side effects

Author: Barbaresi Adrien
Publication venue: HAL CCSD
Publication date: 25/01/2018
Field of study

International audienceThis article presents an approach combining linguistic analysis, geographic information retrieval and visualization in order to go from toponym extraction in historical texts to projection on customizable maps. The toolkit is released under an open source license, it features bootstrapping options, geocod-ing and disambiguation algorithms, as well as cartographic processing. The software setting is designed to be adaptable to various historical contexts, it can be extended by further automatically processed or user-curated gazetteers, used directly on texts or plugged-in on a larger processing pipeline. I provide an example of the issues raised by generic extraction and show the benefits of integrated knowledge-based approach, data cleaning and filtering

Efficient construction of metadata-enhanced web corpora

Author: Barbaresi Adrien
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 12/08/2016
Field of study

International audienceMetadata extraction is known to be a problem in general-purpose Web corpora, and so is extensive crawling with little yield. The contributions of this paper are threefold: a method to find and download large numbers of WordPress pages; a targeted extraction of content featuring much needed metadata; and an analysis of the documents in the corpus with insights of actual blog uses. The study focuses on a publishing software (WordPress), which allows for reliable extraction of structural elements such as metadata, posts, and comments. The download of about 9 million documents in the course of two experiments leads after processing to 2.7 billion tokens with usable metadata. This comparatively high yield is a step towards more efficiency with respect to machine power and " Hi-Fi " web corpora. The resulting corpus complies with formal requirements on metadata-enhanced corpora and on weblogs considered as a series of dated entries. However, existing typologies on Web texts have to be revised in the light of this hybrid genre

Extraction and Visualization of Toponyms in Diachronic Text Corpora

Author: Barbaresi Adrien
Biber Hanno
Publication venue: HAL CCSD
Publication date: 11/07/2016
Field of study

International audienceThis paper focuses on the extraction of German and Austrian place names in historical texts. Our text basis is Die Fackel (The Torch) published by Karl Kraus. The database we develop follows from a combination of approaches: gazetteers are curated in a supervised way to account for historical differences,and current geographical information is used as a fallback. Our maps highlight the linguistic and cultural ties of Kraus and his contemporaries, "Die Fackel" is (at least) a European phenomenon; Kraus' vision of Europe is more inclined towards cultural centers

Construction de corpus généraux et spécialisés à partir du Web

Author: Barbaresi Adrien
Publication venue: HAL CCSD
Publication date: 19/06/2015
Field of study

At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus. Existing corpus and text definitions are discussed. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed.In the second chapter, methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing are presented. The state of the art on text quality assessment and web text filtering exemplifies current interdisciplinary research trends on web texts. Readability studies and automated text classification are used as a paragon of methods to find salient features in order to grasp text characteristics. Text visualization exemplifies corpus processing in the digital humanities framework. As a conclusion, guiding principles for research practice are listed, and reasons are given to find a balance between quantitative analysis and corpus linguistics, in an environment which is spanned by technological innovation and artificial intelligence techniques.Third, current research on web corpora is summarized. I distinguish two main approaches to web document retrieval: restricted retrieval and web crawling. The notion of web corpus preprocessing is introduced and salient steps are discussed. The impact of the preprocessing phase on research results is assessed. I explain why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase.I present my work on web corpus construction in the fourth chapter. My analyses concern two main aspects, first the question of corpus sources (or prequalification), and secondly the problem of including valid, desirable documents in a corpus (or document qualification). Last, I present work on corpus visualization consisting of extracting certain corpus characteristics in order to give indications on corpus contents and quality.Le premier chapitre s'ouvre par un description du contexte interdisciplinaire. Ensuite, le concept de corpus est présenté en tenant compte de l'état de l'art. Le besoin de disposer de preuves certes de nature linguistique mais embrassant différentes disciplines est illustré par plusieurs scénarios de recherche. Plusieurs étapes clés de la construction de corpus sont retracées, des corpus précédant l'ère digitale à la fin des années 1950 aux corpus web des années 2000 et 2010. Les continuités et changements entre la tradition en linguistique et les corpus tirés du web sont exposés.Le second chapitre rassemble des considérations méthodologiques. L'état de l'art concernant l'estimation de la qualité de textes est décrit. Ensuite, les méthodes utilisées par les études de lisibilité ainsi que par la classification automatique de textes sont résumées. Des dénominateurs communs sont isolés. Enfin, la visualisation de textes démontre l'intérêt de l'analyse de corpus pour les humanités numériques. Les raisons de trouver un équilibre entre analyse quantitative et linguistique de corpus sont abordées.Le troisième chapitre résume l'apport de la thèse en ce qui concerne la recherche sur les corpus tirés d'internet. La question de la collection des données est examinée avec une attention particulière, tout spécialement le cas des URLs sources. La notion de prétraitement des corpus web est introduite, ses étapes majeures sont brossées. L'impact des prétraitements sur le résultat est évalué. La question de la simplicité et de la reproducibilité de la construction de corpus est mise en avant.La quatrième partie décrit l'apport de la thèse du point de vue de la construction de corpus proprement dite, à travers la question des sources et le problèmes des documents invalides ou indésirables. Une approche utilisant un éclaireur léger pour préparer le parcours du web est présentée. Ensuite, les travaux concernant la sélection de documents juste avant l'inclusion dans un corpus sont résumés : il est possible d'utiliser les apports des études de lisibilité ainsi que des techniques d'apprentissage artificiel au cours de la construction du corpus. Un ensemble de caractéristiques textuelles testées sur des échantillons annotés évalue l'efficacité du procédé. Enfin, les travaux sur la visualisation de corpus sont abordés : extraction de caractéristiques à l'échelle d'un corpus afin de donner des indications sur sa composition et sa qualité

HAL-ENS-LYON

Thèses en Ligne

HAL